MAPPING TEXTS: COMBINING TEXT-MINING AND GEO-VISUALIZATION TO UNLOCK THE RESEARCH POTENTIAL OF HISTORICAL NEWSPAPERS A White Paper for the National Endowment for the Humanities

نویسندگان

  • Andrew J. Torget
  • Rada Mihalcea
  • Jon Christensen
  • Geoff McGhee
چکیده

In this paper, we explore the task of automatic text processing applied to collections of historical newspapers, with the aim of assisting historical research. In particular, in this first stage of our project, we experiment with the use of topical models as a means to identify potential issues of interest for historians. 1 Newspapers in Historical Research Surviving newspapers are among the richest sources of information available to scholars studying peoples and cultures of the past 250 years, particularly for research on the history of the United States. Throughout the nineteenth and twentieth centuries, newspapers served as the central venues for nearly all substantive discussions and debates in American society. By the mid-nineteenth century, nearly every community (no matter how small) boasted at least one newspaper. Within these pages, Americans argued with one another over politics, advertised and conducted economic business, and published articles and commentary on virtually all aspects of society and daily life. Only here can scholars find editorials from the 1870s on the latest political controversies, advertisements for the latest fashions, articles on the latest sporting events, and languid poetry from a local artist, all within one source. Newspapers, in short, document more completely the full range of the human experience than nearly any other source available to modern scholars, providing windows into the past available nowhere else. Despite their remarkable value, newspapers have long remained among the most underutilized historical resources. The reason for this paradox is quite simple: the sheer volume and breadth of information available in historical newspapers has, ironically, made it extremely difficult for historians to go through them page-by-page for a given research project. A historian, for example, might need to wade through tens of thousands of newspaper pages in order to answer a single research question (with no guarantee of stumbling onto the necessary information). Recently, both the research potential and problem of scale associated with historical newspapers has expanded greatly due to the rapid digitization of these sources. The National Endowment for the Humanities (NEH) and the Library of Congress (LOC), for example, are sponsoring a nationwide historical digitization project, Chronicling America, geared toward digitizing all surviving historical newspapers in the United States, from 1836 to the present. This project recently digitized its one millionth page (and they project to have more than 20 million pages within a few years), opening a vast wealth of historical newspapers in digital form. While projects such as Chronicling America have indeed increased access to these important sources, they have also increased the problem of scale that have long prevent scholars from using these sources in meaningful ways. Indeed, without tools and methods capable of handling such large datasets – and thus sifting out meaningful patterns embedded within them – scholars find themselves confined to performing only basic word searches across enormous collections. These simple searches can, indeed, find stray information scattered in unlikely 46 places. Such rudimentary search tools, however, become increasingly less useful to researchers as datasets continue to grow in size. If a search for a particular term yields 4,000,000 results, even those search results produce a dataset far too large for any single scholar to analyze in a meaningful way using traditional methods. The age of abundance, it turns out, can simply overwhelm historical scholars, as the sheer volume of available digitized historical newspapers is beginning to do. In this paper, we explore the use of topic modeling, in an attempt to identify the most important and potentially interesting topics over a given period of time. Thus, instead of asking a historian to look through thousands of newspapers to identify what may be interesting topics, we take a reverse approach, where we first automatically cluster the data into topics, and then provide these automatically identified topics to the historian so she can narrow her scope to focus on the individual patterns in the dataset that are most applicable to her research. Of more utility would be where the modeling would reveal unexpected topics that point towards unusual patterns previously unknown, thus help shaping a scholar’s subsequent research. The topic modeling can be done for any periods of time, which can consist of individual years or can cover several years at a time. In this way, we can see the changes in the discussions and topics of interest over the years. Moreover, pre-filters can also be applied to the data prior to the topic modeling. For instance, since research being done in the History department at our institution is concerned with the “U. S. cotton economy,” we can use the same approach to identify the interesting topics mentioned in the news articles that talk about the issue of “cotton.” 2 Topic Modeling Topic models have been used by Newman and Block (2006) and Nelson (2010) on newspaper corpora to discover topics and trends over time. The former used the probabilistic latent semantic analysis (pLSA) model, and the latter used the latent Dirichlet allocation (LDA) model, a method introduced by Blei et al. (2003). LDA has also been used by Griffiths and Steyvers (2004) to find research topic trends by looking at abstracts of scientific papers. Hall et al. (2008) have similarly applied LDA to discover trends in the computational linguistics field. Both pLSA and LDA models are probabilistic models that look at each document as a mixture of multinomials or topics. The models decompose the document collection into groups of words representing the main topics. See for instance Table 1, which shows two topics extracted from our collection. Topic worth price black white goods yard silk made ladies wool lot inch week sale prices pair suits fine quality state states bill united people men general law government party made president today washington war committee country public york Table 1: Example of two topic groups Boyd-Graber et al. (2009) compared several topic models, including LDA, correlated topic model (CTM), and probabilistic latent semantic indexing (pLSI), and found that LDA generally worked comparably well or better than the other two at predicting topics that match topics picked by the human annotators. We therefore chose to use a parallel threaded SparseLDA implementation to conduct the topic modeling, namely UMass Amherst’s MAchine Learning for LanguagE Toolkit (MALLET) (McCallum, 2002). MALLET’s topic modeling toolkit has been used by Walker et al. (2010) to test the effects of noisy optical character recognition (OCR) data on LDA. It has been used by Nelson (2010) to mine topics from the Civil War era newspaper Dispatch, and it has also been used by Blevins (2010) to examine general topics and to identify emotional moments from Martha Ballards Diary. 3 Dataset Our sample data comes from a collection of digitized historical newspapers, consisting of newspapers published in Texas from 1829 to 2008. Issues are segmented by pages with continuous text containing articles and advertisements. Table 2 provides more information about the dataset. 2 http://mallet.cs.umass.edu/ 1 http://americanpast.richmond.edu/dispatch/ 3 http://historying.org/2010/04/01/ 47 Property Number of titles Number of years Number of issues Number of pages Number of tokens 114 180 32,745 232,567 816,190,453 Table 2: Properties of the newspaper collection 3.1 Sample Years and Categories From the wide range available, we sampled several historically significant dates in order to evaluate topic modeling. These dates were chosen for their unique characteristics (detailed below), which made it possible for a professional historian to examine and evaluate the relevancy of the results. These are the subcategories we chose as samples: • Newspapers from 1865-1901: During this period, Texans rebuilt their society in the aftermath of the American Civil War. With the abolition of slavery in 1865, Texans (both black and white) looked to rebuild their post-war economy by investing heavily in cotton production throughout the state. Cotton was considered a safe investment, and so Texans produced enough during this period to make Texas the largest cotton producer in the United States by 1901. Yet overproduction during that same period impoverished Texas farmers by driving down the market price for cotton, and thus a large percentage went bankrupt and lost their lands (over 50 percent by 1900). As a result, angry cotton farmers in Texas during the 1890s joined a new political party, the Populists, whose goal was to use the national government to improve the economic conditions of farmers. This effort failed by 1896, although it represented one of the largest third-party political revolts in American history. This period, then, was dominated by the rise of cotton as the foundation of the Texas economy, the financial failures of Texas farmers, and their unsuccessful political protests of the 1890s as cotton bankrupted people across the state. These are the issues we would expect to emerge as important topics from newspapers in this category. This dataset consists of 52,555 pages over 5,902 issues. • Newspapers from 1892: This was the year of the formation of the Populist Party, which a large portion of Texas farmers joined for the U. S. presidential election of 1892. The Populists sought to have the U. S. federal government become actively involved in regulating the economy in places like Texas (something never done before) in order to prevent cotton farmers from going further into debt. In the 1892 election, the Populists did surprisingly well (garnering about 10 percent of the vote nationally) and won a full 23 percent of the vote in Texas. This dataset consists of 1,303 pages over 223 issues. • Newspapers from 1893: A major economic depression hit the United States in 1893, devastating the economy in every state, including Texas. This exacerbated the problem of cotton within the states economy, and heightened the efforts of the Populists within Texas to push for major political reforms to address these problems. What we see in 1893, then, is a great deal of stress that should exacerbate trends within Texas society of that year (and thus the content of the newspapers). This dataset consists of 3,490 pages over 494 issues. • Newspapers from 1929-1930: These years represented the beginning and initial onset in the United States of the Great Depression. The United States economy began collapsing in October 1929, when the stock market crashed and began a series of economic failures that soon brought down nearly the entire U. S. economy. Texas, with its already shaky economic dependence on cotton, was as devastated as any other state. As such, this period was marked by discussions about how to save both the cotton economy of Texas and about possible government intervention into the economy to prevent catastrophe. This dataset consists of 6,590 pages over 973 issues. Throughout this era, scholars have long recognized that cotton and the economy were the dominating issues. Related to that was the rise and fall

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Trading Consequences: A Case Study of Combining Text Mining and Visualization to Facilitate Document Exploration

Large-scale digitization efforts and the availability of computational methods, including text mining and information visualization, have enabled new approaches to historical research. However, we lack case studies of how these methods can be applied in practice and what their potential impact may be. Trading Consequences is an interdisciplinary research project between environmental historians...

متن کامل

Trading Consequences: A Case Study of Combining Text Mining & Visualisation to Facilitate Document Exploration

Trading Consequences is an interdisciplinary research project between historians, computational linguists and visualization specialists. We use text mining and visualisations to explore the growth of the global commodity trade in the nineteenth century. Feedback from a group of environmental historians during a workshop provided essential information to adapt advanced text mining and visualisat...

متن کامل

A clustering approach for mineral potential mapping: A deposit-scale porphyry copper exploration targeting

This work describes a knowledge-guided clustering approach for mineral potential mapping (MPM), by which the optimum number of clusters is derived form a knowledge-driven methodology through a concentration-area (C-A) multifractal analysis. To implement the proposed approach, a case study at the North Narbaghi region in the Saveh, Markazi province of Iran, was investigated to discover porphyry ...

متن کامل

Visual analysis and exploration of ancient texts with a user-driven concept search

In the last decades, the amount of digital data has grown rapidly. This also impacted the humanities field where the humanists and computer scientists collaborate to digitize historical texts for preservation purpose. Due to this effort, many historical texts that were accessible to few scholars became widely available in form of digital texts. The focus shifted from printed to digital texts. T...

متن کامل

Approximate resistivity and susceptibility mapping from airborne electromagnetic and magnetic data, a case study for a geologically plausible porphyry copper unit in Iran

This paper describes the application of approximate methods to invert airborne magnetic data as well as helicopter-borne frequency domain electromagnetic data in order to retrieve a joint model of magnetic susceptibility and electrical resistivity. The study area located in Semnan province of Iran consists of an arc-shaped porphyry andesite covered by sedimentary units which may have potential ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012